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Abstract 

In cluster analysis, it could be useful to interpret the obtained par- 
tition with respect to external qualitative variables. An approach is 
proposed in the model-based clustering context to select a model and 
a number of clusters in order to get a partition which both provides 
a good fit with the data and is well related to the external variables. 
This approach makes use of the integrated joint likelihood of the data 
and the partitions at hand, namely the model-based partition and the 
partitions associated to the external variables. It is worth noticing that 
the known partitions are only used to select a relevant mixture model. 
Each mixture model is fitted by the maximum likelihood methodol- 
ogy from the data. Numerical experiments illustrate the promising 
behaviour of the derived criterion. 
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1 Introduction 



In model selection, assuming that the data arose from one of the models in 
competition is often somewhat unrealistic and could be misleading. However 
this assumption is implicitly made when using standard model selection 
criteria such as AIC or BIC. This "true model" assumption could lead to 
overestimating the model complexity in practical situations. On the other 
hand, a common feature of standard penalized likelihood criteria such as AIC 
and BIC is that they do not take into account the modelling purpose. Our 
opinion is that taking account the modelling purpose when selecting a model 
leads to more flexible criteria favoring useful and parsimonious models. This 
point of view could be exploited in many statistical learning situations. Here, 
it is developed in a model-based clustering context to choose a sensible 
partition of the data, eventually favoring partitions leading to a relevant 
interpretation with respect to external qualitative variables. The paper is 
organised as follows. In Section 2, the framework of model-based clustering is 
described. Our new penalised likelihood criterion is presented in Section 3. 
Numerical experiments on simulated and real data sets are presented in 
Section 4 to illustrate the behavior of this criterion and highlight its possible 
interest. A short discussion section ends the paper. 

2 Model-based clustering 

Model-based clustering consists of assuming that the data set to be classified 
arises from a mixture distribution, trying to recover it at best and associ- 
ating each cluster with one of the mixture components. Embedding cluster 
analysis in this precise framework is useful in many aspects. In particular, 
it allows to choose the number K of classes (i.e. the number of mixture 
components) in a proper way. 

2.1 Finite mixture models 

Please refer to McLachlan and Peel (2000) for a comprehensive introduction 
to finite mixture models. 

Data to be classified y in H, nd are assumed to arise from a mixture 

K 



k=l 

where the p^s are the mixing proportions and </>(■ | a^) denotes the mixture 
probability density function (as the d-dimensional Gaussian density) with 
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parameter a^, and 9 k = {pi, ■ ■ ■ ,Pk-i,&i, ■ ■ ■ , &k)- The corresponding pa- 
rameter space is denoted by Qk- A mixture model can be regarded as a 
latent structure model involving unknown label data z = (z±, . . . , z n ) which 
are binary vectors with = 1 if and only if y^ arises from component k. 
Those indicator vectors define a partition P = (Pi, . . . ,Pr) of the data y 
with Pk = {yi | = 1}. Each model is usually fitted through maximum 
likelihood estimation. The corresponding estimator, denoted from now on 
by Ok, is generally derived with the EM algorithm (Dempster et al., 1977; 
McLachlan and Krishnan, 1997). From a density estimation perspective, a 
classical way for choosing a mixture model is to select the model maximising 
the integrated likelihood, 

f(y | K) = [ f(y | 9 K )ir(9 K )de K , 

n 

f(y|0*) = II/(yi|0tf), 

i=i 

tt(9k) being a weakly informative prior distribution on 9k- For n large 
enough, it can be approximated with the BIC criterion (Schwarz, 1978) 

logf(y | K) « logf(y | 9 K ) - -^-\ogn, 

with Ok the maximum likelihood estimator and vk the number of free pa- 
rameters in the mixture model with K components. Numerical experiments 
(see for instance Roeder and Wasserman, 1997) show that BIC works well 
at a practical level for mixture models. 

2.2 Choosing K from the clustering view point 

In the model-based clustering context, an alternative to the BIC criterion 
is the ICL criterion (Biernacki et al., 2000) which aims at maximising the 
integrated likelihood of the complete data (y, z) 

f(y,z|if)= / f(y,z | 6 K )K{0 K )dOK, 

It can be approximated with a BIC-like approximation: 

logf(y,z | K) w logf(y,z | 0* K ) - ^-\ogn 

0* K = argmaxf(y,z | K )- 

Ok 
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But z and 9* K are unknown. Arguing that 6k ~ 9\ if the mixture compo- 
nents are well separated for n large enough, Biernacki et al. (2000) replace 
9* K by 9k arid the missing data z with z = MAP (9 k) defined by 

J 1 if argmax^ t^(6k) = k 
1 \ otherwise, 

t\(9k) denoting the conditional probability that yj arises from the /cth mix- 
ture component (1 < % < n and 1 < k < K): 

k Pk(t>{yi I a fc ) 

T i = K • \*-) 

22e=iPmyi I a i) 

Finally the ICL criterion is 

ICL(^) = logf(y,z|K,^)-^logn. (2) 

Roughly speaking ICL is the criterion BIC decreased by the estimated mean 
entropy 

K n 
k=l i=l 

This is apparent if the estimated labels z are replaced in the definition (2) 
by their respective conditional expectation t^(9k)- 

Because of this additional entropy term, ICL favors values of K giving 
rise to partitioning the data with the greatest evidence. The derivation and 
approximations leading to ICL are questioned in Baudry (2009, Chapter 4). 
However, in practice, ICL appears to provide a stable and reliable estimate of 
K for real data sets and also for simulated data sets from the clustering view 
point. ICL, which is not aiming at discovering the true number of mixture 
components, can underestimate the number of components for simulated 
data arising from mixtures with poorly separated components Biernacki 
et al. (2000). It concentrates on selecting a relevant number of classes. 

Remark that, for a given number of components K and a parameter 
9k, the class of each observation yj is assigned according to the MAP rule 
defined above. 

3 A particular clustering selection criterion 

Suppose that the problem is to classify observations described with vectors 
y's. But, in addition, a known classification u on the population, associ- 
ated to a qualitative variable not directly related to the variables defining 
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the vector y, is available. Relating the classification z and the classification 
u could be of interest to get a suggestive and simple interpretation of the 
classification z. With this purpose in mind, it is possible to define a penal- 
ized likelihood criterion which selects a model providing a good compromise 
between the mixture model fit and its ability to lead to a clear classification 
of the observations well related to the external classification u. Ideally, it 
is wished that y and u should be conditionally independent knowing z, as 
holds if u can be written as a function of z. Let us consider the following 
heuristics. The problem is to find the mixture model m maximising the 
integrated completed likelihood 

p(y, u, z | m) = J p(y, u,z|m, m )^{0 m )d6 m . 

Note that, since a mixture model m is not only characterized with the num- 
ber of components K, but also with assumptions on the proportions and the 
component variance matrices (see Celeux and Govaert, 1995), it is indexed 
with m rather than K in the following. 

Using a BIC-like approximation as in Biernacki et al. (2000), 

logp(y, u,z|m)« logp(y, u,z|m, 6* m ) 

-^logn, (3) 

with 

9* m = argmaxp(y,u,z | m,6 m ). 

dm 

An approximation analogous to that leading to ICL is done: Q* m is replaced 
by m , the maximum likelihood estimator. The unknown labels z are then 
replaced by the labels deduced from the MAP rule with this estimator. 
Assuming moreover that y and u are conditionally independent knowing z, 
which should hold at least for mixtures with enough components, it can be 
written 

logp(y, u, z | m, 6* m ) = logp(y, u\z,m, §* m ) 

+ logp(z I m,0* m ) 
= logp(y | z,m,9* m ) (4) 
+ logp(z | m,§* m ) 

+ logp(u | z, m, 0^) 
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using the conditional independence assumption. From (3) this yields 

logp(y, u, z | m) « logp(y, z | m, m ) 

+ logp(u I z,m,9 m ) 

I'm , 

- -y logn, 

and the estimation of logp(u | z,9 m ) is derived from the contingency table 
(rike) relating the qualitative variables u and z: for any k E {1, . . . , if} and 
£ G {1, . . . , C/max}, ^max being the number of levels of the variable u, 

nke = cardjz : zik = 1 and Uj = £}. 

logp(u | z, m ) is then estimated by 

W Umax K 

i=i *»■ i=i fc=i K - 

where n&. = YS=i n ke, 

Finally, this leads to the Supervised Integrated Completed Likelihood 
(SICL) criterion 

Umax K 

nke 



SICL{m) = ICL(m) + Tiki log 



1 k=l ^ 



The last additional term Y^i=\ x Sfc=i n ki log ^~ quantifies the strength of 



the link between the qualitative variables u and z. 



Taking several external variables into account The same kind of 
derivation enables to derive a criterion that takes into account several ex- 
ternal variables u 1 , . . . , u r . Suppose that y, u , . . . , u r are conditionally in- 
dependent knowing z. Then (4) gets 

logp(y, u 1 , . . . , u r , z | m, 6* m ) = logp(y | z, m, 9* m ) 

+ logp(z I m,9* m ) 

+ logp(u 1 | z,m,6* m ) (5) 
+ ... 

+ logp(u r | z,m,9* m ), 
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with 9* = argmaxg m p(y, u 1 , . . . , u r , z | m, 9 m ). As before, we assume that 
9 m ~ 9 m and apply the BIC-like approximation. Finally, 

logp(y,u 1 , . . . ,u r ,z | m) « logp(y,z | m,9 m ) 

+ logp(u 1 | z,m,9 m ) 
+ ... 

+ logp(u r | z,m,9 m ) 
- —logn, 

and as before, the estimation of logp(u J | z,9 m ) is derived from the con- 
tingency table {n J uf) relating the qualitative variables u J and z: for any 
k G {1, . . . , K} and £ S {1, . . . , ^maxl' ^max being the number of levels of 
the variable u J , 

n J k g = cardji : = 1 and u\ = £}■ 

Finally, with = Y^iLi n ke ^ or an y 3 ana ^ ^ (this does not depend on j), 
we get the multiple external variables criterion "multi-SICL" : 



r Uhax K 

SICL(m) = ICL(m) + ^ Yl Yl n M lo S " 



j=l l=\ k=l K 



4 Numerical experiments 

We first present two simple applications to show that the SICL criterion is 
doing the job it is expected to do. The first example is an application to 
the Iris data set (Fisher, 1936) which consists of 150 observations of four 
measurements (y) for three species of Iris (u). Those data are depicted 
in Figure 1 and the variations of criteria BIC, ICL and SICL in function 
of K are provided in Figure 2. While BIC and ICL choose two classes, 
SICL selects the three-component mixture solution which is closely related 
to the species of Iris, as attested by the contingency table between the two 
partitions (Table 1). 

For the second experiment, we simulated 200 observations from a Gaus- 
sian mixture in R 2 depicted in Figure 3 and the variable u corresponds 
exactly to the mixture component from which each observation arises. Di- 
agonal mixture models (i.e. with diagonal variance matrices) are fitted. The 
variations of the criteria BIC, ICL and SICL in function of K are provided 
in Figure 4. We repeated this experiment with 100 different simulated data 
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Figure 1: Iris data set 



Figure 2: Information criteria for the 
Iris data set 



Table 1: Iris data. Contingency table between the "species'' 
the classes derived from the three-component mixture. 



variable and 



Species^^ 


k 


1 


2 


3 


Setosa 







50 





Versicolor 




45 





5 


Virginica 










50 



s 



sets. BIC almost always recovers the four Gaussian components, while ICL 
almost always selects three because of the two very overlapping ones (the 
"cross" ) . Since the solution obtained through MLE with the four-component 
mixture model yields classes nicely related to the considered u classes, SICL 
favors the four-component solution more than ICL does. But since it also 
takes the overlapping into account, it still selects the three-component model 
about half of the times (56 times out of 100 in our experiments), and selects 
the four-component model in almost all the remaining cases (40 out of 100). 
Actually, as illustrated in Figure 4 for a given data set, SICL hesitates be- 
tween three and four clusters. In this case, this suggests considering both 
solutions. 




Figure 3: "Cross" data set Figure 4: Information criteria for the 

"Cross" data set 

In the next two experiments, we illustrate that SICL does not interfere 
with the model selection when u cannot be related with the mixture dis- 
tributions at hand. At first, we consider a situation where u is a two-class 
partition which has no link at all with a four-component mixture data. In 
Figure 5 the classes of u are in red and in blue. As is apparent from Figure 6, 
SICL does not change the solution K = 4 provided by BIC and ICL. 

Then we consider a two-component mixture and a two-class u partition 
"orthogonal" to this mixture. In Figure 7 the classes of u are in red and 
in blue. As is apparent from Figure 8, SICL does not change the solution 
K = 2 provided by BIC and ICL despite this solution has no link at all with 
the u classes. 
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Figure 5: Simulated data set Figure 6: Information criteria for 

this simulated data set 




Figure 7: Simulated data set Figure 8: Information criteria for 

this simulated data set 
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4.1 Real data set: wholesale customers 



The segmentation of customers of a wholesale distributor is performed to 
illustrate the performance of the SICL criterion. The data set refers to 440 
customers of a wholesale: 298 from the Horeca (Hotel/Restaurant/Cafe) 
channel and 142 from the Retail channel. They are distributed into two 
large Portuguese cities regions (Lisbon and Oporto) and a complementary 
region. 



Table 2: Distribution of the Region variable 



Region 


Frequency 


Percentage 


Lisbon 


77 


17.5 


Oporto 


47 


10.5 


Other region 


316 


31.8 


Total 


440 


100 



The wholesale data concerns customers. It includes the annual spend- 
ing in monetary units (m.u.) on product categories: fresh products, milk 
products, grocery, frozen products, detergents and paper products, and del- 
icatessen. These variables are summarized in Table 3. 



Table 3: Product categories sales (m.u.). 





Mean 


Std. Deviation 


Fresh products 


12000 


12647 


Milk products 


5796 


5796 


Grocery 


7951 


9503 


Frozen 


3072 


4855 


Detergents and Paper 


2881 


4768 


Delicassen 


1525 


2820 



Data also includes responses to a questionnaire intended to evaluate pos- 
sible managerial actions with potential impact on sales such as improving 
the store layout, offering discount tickets or extending products' assortment. 
The customers were asked whether the referred action would have impact 
on their purchases in the wholesale and their answers were registered in 
the scale: 1-Certainly no; 2-Probably no; 3-Probably yes; 4-Certainly yes. 
A Gaussian mixture model has been fitted on the continuous variables de- 
scribed in Table 3 with Rmixmod Lebret et al. (2012)(Lebret et al, 2012). 
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The results are presented in Figure 9. 




Figure 9: Information criteria for the wholesale dataset 

The SICL values based on the Channel, Region, Channel and Region 
external variables are indicated by SICL1, SICL2 and SICL12 respectively. 
BIC and ICL select a useless nine-cluster solution, with no clear interpreta- 
tion. SICL1 selects a four-cluster solution, SICL2 a five-cluster solution and 
SICL12 a three-cluster solution. 

The five-cluster solution is less usable than the alternatives (see Fig- 
ure 10). Figure 11 highlights the link between the four-cluster solution and 
the Channel external variable. The product categories spending patterns 
associated to each cluster are displayed in Figure 12. The cluster 3 is small 
but includes customers that spend a lot and tend to be particularly sensitive 
to the potential extension of the products' assortment (see Figure 14). 

SICL12 provides the most clear-cut selection (see Figure 9) and parsi- 
monious solution. As a matter of fact, this three-cluster solution is well 
linked with the external variables (see Figures 15 and 16) while the clus- 
ters remain easily discriminated by the product categories' spendings: in 
particular, cluster 2 (resp. 3) includes a majority of Horeca (resp. Retail) 
customers buying a lot of fresh products (resp. grocery) (see Figure 13). 
Cluster 3 is slightly more sensitive to the offering of discount tickets while 
cluster 2 is slightly more prone to react to improvement of the store layout 
(see Figure 17). 
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cluster 1 cluster 2 cluster 3 



Figure 10: Distribution of the vari- 
able Region on the SICL2 solution 




Figure 11: Distribution of the vari- 
able Channel on the SICL1 solution 




Fresh products 
Milk products 
□ Grocery 

j D-i-i gents and paper 
products 
Delicatessen 




Fresh products 
Milk products 
J C-i ucery 

~i Detergents and pi 
products 

D-li- ;.ir-vi-Hi 



Figure 12: Distribution of the prod- 
uct categories on the SICL1 solution 



Figure 13: Distribution of the prod- 
uct categories on the SICL12 solu- 
tion 
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Will buy more if: the Wholesale extends products assortment 




cluster 1 cluster 2 cluster 3 cluster 4 Total 



Figure 14: SICL1 solution and managerial actions 




cluster! clusters clusters cluster 1 clusters clusters 



Figure 15: Distribution of the Chan- Figure 16: Distribution of the Re- 

nel variable on the SICL12 solution gion variable on the SICL12 solution 
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Will buy more if: Wholesale offers discount tickets 



Will buy more if: Wholesale improves the store layout 




Figure 17: SICL12 solution and managerial actions 



5 Discussion 

The criterion SICL has been conceived in the model-based clustering con- 
text to choose a sensible number of classes possibly well related to an exter- 
nal qualitative variable or a set of external qualitative variables of interest 
(variables other than the variables on which the clustering is based). This 
criterion can be useful to draw attention to a well-grounded classification 
related to this external qualitative variables. It is an example of a model se- 
lection criterion taking into account the modeler purpose to choose a useful 
and stable model. From our experience, in many situations, SICL selects the 
same models as the criteria ICL or BIC. But when SICL provides a different 
answer than ICL or BIC, it could shed light to a quite interesting clustering 
as illustrated in the numerical experiments. It seems that SICL could be 
expected to select a different partition than ICL particularly when several 
external variables are considered. Thus, SICL could highlight partitions of 
special interest with respect to external qualitative variables. Therefore, we 
think that SICL deserves to enter in the toolkit of model selection criteria 
for clustering. In most cases, it will propose a sensible solution and when 
it points out an original solution, it could be of great interest for practical 
purposes. 
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